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1. Introduction 

In this article we consider data-based evaluation of statistical models, where by a statistical 
model we mean a parametric family of distributions. We will denote a parametric model 
by A4, and an element of the model by Mg, for 6 in some parameter space 0, so that 
A4 = {Me;0 e O}. A collection of a set of probability models Aik, k = 1,2,... will 
constitute a model class, say M. We will focus on the problem of selecting one statistical 
model from the competing models in the class M = {Adk}-, where k is an index for the 
parametric model under consideration. In the example we consider in Section 4, M is the 
class of all finite mixtures of multivariate Gaussians, and for each k, Mk is the set of all 
fc-component multivariate Gaussian mixtures. A particular model element Mg G A^^^ is a 
fc-componcnt mixture with specific values for the parameters 6. Let us also use to denote 
the true distribution of the data. 

In general, model selection can be approached from two main philosophical perspectives: 
(1) testing-based model assessment and (2) parsimony-based model assessment. The testing- 
based approach involves testing, for each k, Hq : ^ A4k vs Hi : Ft E AAk+i- The 
Likelihood Ratio Test (LRT) is often used to build a testing-based model selection method. 
However, we do not favor a simple testing approach, because, in the spirit of George Box, 
our prior belief is that every restricted model is fiawed, in which case every model would be 
rejected by testing at some sample size. Under this belief, a suitable model selection theory 
should instead be based on the adequacy of the model approximation. 

In parsimony-based model selection, we define a selection criterion, which usually in- 
volves a term accounting for the goodness of fit and a term penalizing richer models (models 
with more parameters). Then we choose as the best model among a given subset of models, 
i.e. Mbest G M, the model which optimizes our chosen criterion. The information criterion 
based approaches such as AIC (Akaike, 1973), BIG (Schwarz, 1978) and other model selec- 
tion criteria based on approximations of Bayes factors (Haughton, 1988) all fall under this 
large class of parsimony-based model selection tools. These tools are usually designed under 
a specific probability structure or are designed to achieve specific goals. For example AIC 
is designed to achieve maximum predictive accuracy whereas the BIG criterion evaluates 
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the posterior probability of the competing models with specified priors. (Yang, 2005) 

In practice the most parsimonious model is determined by the model associated with 
minimum risk under the specific loss function. For example, in case of the AIC criterion, the 
loss function of the risk is the KuUback-Leibler (KL) distance. We will here consider model 
selection from the same basic framework but we will replace KuUback-Leibler distance with 
quadratic distance (Lindsay et al., 2007). We do so because the family of quadratic distances 
allows for a very straightforward estimation of risk when the distance itself is used as the 
loss function. As evident later in this paper, the resulting quadratic risk has many desirable 
properties especially for high-dimensional data and for model comparisons in a non-regular 
parametric setup. 

Before describing the quadratic risk we want to point out two underlying themes of this 
article. First, as our risk measure is in the spirit of AIC, we will draw parallel to AIC and 
wherever necessary, contrast our quadratic risk measures to AIC. Also, this research was 
originally motivated by the problem of model selection for high-dimensional data in the 
normal mixture model, a non-regular parametric class. We will use this important model 
to demonstrate and address pivotal mathematical and computational problems in model 
selection. 

1.1. Quadratic risia based model selection 

We will develop our model selection criteria within the following risk framework. The loss 
incurred in using a model element Mq (from a model A^) when the true distribution is Ft 
will be denoted by a{Fr,Me). The function a should measure in some meaningful fashion 
the price of using Me to approximate Fr. The risk of using the model A^, given a particular 
parameter estimator 0, will be defined to be 

Pa{Fr,M,m) = E{a{Fr,M^)), 

where the m on the left side indicates that the estimate 9 on the right side was based on 
a sample of size m, and the expectation is taken over the true distribution. We alert the 
reader that we are retaining m as a risk argument because we will consider the estimation 
of risk for values of m other than the actual data size n. Our goal here is to choose the 
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model that minimizes this risk, noting that the lowest risk model is likely to depend on 
m. That is, in this formal definition the problem is not to pick a true model, or even a 
closest-to-truc model, from M: it is to pick a model for which our estimated model clement 
Mg will be the most similar to Fr, on average, in samples of size m. Although models with 
many parameters are likely to be closer to the true model, the estimation of the model 
parameters creates extra variability that is penalized in the risk calculation. 

The AIC model selection criterion arises in this context as follows. If we assume both 
Mg and F-r to be continuous density functions, with densities me{x) and frix) respectively, 
the KuUback Leibler(KL) loss function is 

KL(F„A/e) - / log (^^) fr{x)dx. 
J \me[x)J 

Using twice the KuUback-Leibler as a loss function, we can split the risk into two pieces; 
PKL(A1,m) = -2E 

= -2E 



\og{mg{x))fr{x)dx + / \og{jT{x))fr{x)dx 



+ 2 log{Mx))frix)dx (1) 



log{mg{x))frix)dx 

Here again m represents the sample size used for the estimator 9 and we will think of it as 
a variable that changes the magnitude of the penalty function. Of course, the convention 
is to replace m with n, the actual sample size. The AIC criterion is based on estimation 
of the first term on the right in (1). This can be done because the second term does not 
depend on any of the models under consideration. As a consequence, one does not estimate 
the full risk of using a model, but only its relative risk, which in turn depends on the other 
models under consideration. That is, a model that is chosen as the best among a class of 
models will still have an uncertain amount of total risk, and so could in fact be a poor fit. 

Our quadratic risk is defined by using a quadratic distance (Lindsay et al., 2007) as 
the loss function a(.,.). A quadratic distance between the true distribution F^ and any 
proposed distribution G, where G can be either discrete or continuous, has the form 

dK{Fr,G) ^ jj KG{s,t)d{Fr - G){s)d{Fr - G){t), (2) 

where Kq is a kernel of appropriate dimensions (details in Section 2), that can be chosen 
by the user to meet specific goals. Note that we will allow the kernel to depend on G, the 
second argument in distance dK{Fr,G). 
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A familiar example of this type is the Pearson Chi-squared distance on a partitioned 
sample space. 

Example 1. Let Ai, ...tAc be a partition of tiie sample space S into C bins and let G 
be a target probability measure. Define the kernel by 

^ i[x e AMy e A,] 
KG{x,y)^^^ , (3) 

where I is the indicator function and G{Ai) = dG{x) . The resulting quadratic distance 
is the Pearson Chi-square, given by: 

^ [F.(AO-G(AOf 

h G{A,) ■ 
Wc will be using the Pearson Chi-squared example throughout the paper as a way to 
exemplify our distance calculations in the simplest possible setting. In the example we 
introduce in Section 4 we will be using a Gaussian kernel. 



1.2. Why quadratic risl< based model selection ? 

Why do we propose this new measures of model selection? Here wc discuss three key issues 
that played a role in the development of our methodology: 



1.2.1. Local vs global comparison of models 

Quadratic distance does not separate into two parts in the manner that KuUback Leibler 
docs (see (1)). This means that one must estimate the absolute risk for a model. In the 
process, however, we can learn whether the models provide a low level of absolute risk, 
whereas for AIC we have no assurance that any model, even the best one, fits well in an 
absolute sense. 

The absolute quadratic risk of a parametric model can be calibrated by comparing its 
risk with the risk of the empirical distribution function. The latter estimator has no lack 
of fit but, due to its flexibility, but will have the highest variability. As we will see in our 
simulations, its risk provides a excellent benchmark for model quality. In effect, failure 
to meet this standard means that either more model building is needed, or nonparametric 
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approaches should be used. In our apphcation section we provide examples where the global 
comparison provides crucial decision making information in model selection. 

1.2.2. The role of model regularity conditions 

The approximations used to arrive at AIC, BIC and most other information criterion depend 
strongly on regularity conditions (boundary conditions, nested parameter structure among 
others) some of which are violated in important model selection exercises. For example, 
the regularity problems for nested mixture models are well known (see Section 4). Similar 
problems occur in the structural equations model (BoUcn et al., 2006) and the multilevel 
model, where the boundary between nested models is irregular. All these makes the usual 
asymptotic expansions inappropriate. 

On the other hand, our method is based on global tests of goodness of fit. As a result the 
test statistics that are used have asymptotic expansions that do not depend on regularity 
assumptions for nested models and so have wider validity. 

1.2.3. Addressing computational challenges in high dimensions 

A primary goal of our research was to devise model selection methods that would be prac- 
tical for high dimensional problems. However, quadratic distances, such as the one we are 
proposing as the loss function of our risk, could require numerical integration of the same 
dimension as the data vector, which would defeat our purpose. For our mixture exam- 
ple we will show how to avoid this computational burden by using a "rational" kernel for 
the distance; this gives the needed integration in a closed form. A detailed description of 
construction of these kernels can be found in Lindsay ct al. (2007). 

An additional challenge in high dimensions is the construction of distances whose op- 
erating characteristics can be tuned to the dimension of the problem and the sample size. 
In a Chi-squarcd test this would be done by choosing the number and location of the bins. 
Lindsay et al. (2007) defined the degrees of freedom of a quadratic distance and discussed 
its use to select distances. The kernel we will use in the mixture example includes a "tuning 
parameter" , that allows us to select a suitable degrees of freedom for the distance and also 



allows us to analyze the data at multiple resolutions. 
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1 . 3. Description of Paper 

Section 2 provides a detailed description of our choice of the quadratic distance in defining 
the quadratic risk, the appropriateness of such a definition in the context of high dimensional 
datasets and the estimation of quadratic risk. In Section 3 we build model selection tools 
based on our quadratic risk. Next, in Section 4, we demonstrate how this model selection 
tool can be applied to select the number of components in a multivariate normal mixture. 
Application to real data and simulated datasets and comparison of our method with existing 
model selection tools arc described in Section 5. Section 6 contains a discussion and the 
Appendix provides proofs of results stated in the article. 

2. Quadratic Risk and estimation of Quadratic Risk 

This section provides the foundational details and the estimation techniques for quadratic 
risk. We will start by defining the quadratic distance between two arbitrary probability 
measures. Then, after deriving an unbiased estimator of the quadratic distance we will 
use it to build an unbiased cross-validation based estimator for the quadratic risk measure. 
Although analytically attractive, this cross-validation method is computationally very ex- 
pensive whenever parameter estimation is expensive, especially so for large datasets or 
when the dimensions are high. So, in the later part of this section wc will derive an AIC- 
like approximation to the the risk measure, which is largely based on the decomposition of 
our quadratic risk and the asymptotics of quadratic distance. The definitions and results 
on quadratic distance that are discussed in this section are described in greater detail in 
Lindsay et al. (2007) and Ray (2003). 

2. 1. Quadratic Distance: Definition and empirical estimate 

2.1.1. Quadratic Distance framework 

First we provide the details of construction of the class of quadratic distance, which we use 
as the loss function of our risk measures. We define dK{Fr,G) to be the distance between 



8 Ray and Lindsay 

the true density, and any proposed distribution G, provided that they are defined on 
the same sample space S. The building block for our distance will be K(s,t), a bounded 
symmetric kernel function on S x which is conditionally non-negative (CNND)| definite. 

Definition 1. Given a CNND Kq{s, t), possibly depending on G, the K-based quadratic 
distance between two probability measures Ft and G is defined as 

dK{Fr,G) ^ JJ KG{s,t) d{Fr ~ G){s)d{Fr - G){t). (4) 

Note that by allowing K to depend on G we no longer have an inner product space, but we 
retain non- negativity due to CNND. 



2.1.2. Estimation of Quadratic Distance 

We now focus on the empirical estimation of the quadratic distance. We will later use these 
results for the the estimation of quadratic risk. It will be useful to express our results in 
terms of the n x n dimensional empirical kernel matrix K of a data set xi, . . . , Xn, which 
we define to have ij^'^ element Kij = K{xi, xj). Crucial to the estimation is the concept of 
the centered kernel defined as follows: 

Definition 2. The G-centered kernel K, denoted by Kcen{G) defined as 

KceniG) [x, y) - K{x, y) - K{x, G) - K{G, y) + K{G, G), (5) 
where K{x, G) = J K{x, y)dG{y) and K{G, G) = // K{x, y)dG{x)dG{y). 
For example, the G-centered kernel for the Pearson Chi-squared kernel in (3) simplifies to 

^ l[x € AMv £ A,] 

Kcen{G)ix,y) = 1- (6) 

Using the G-centered kernel we can rewrite the distance in form of a [/-functional as 

dK (F„ G) = J J K,,^(G) {x. vWr [x)dF, iv). (7) 

^ K is CNND if JJ K{s,t) A(j{s)Aa{t) is nonnegative for all bounded signed measures cr, and if 
nonnegativity holds for all cr satisfying the condition J d(T(s) = 0. 
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But as Fr is unknown we will use the empirical cdf F to estimate dK{Fr,G). Using (7), 
for any fixed G, we arrive at the following two estimates for dxiF-n G). First, dK{F, G) = 
Kcen{G){PT •= is ^ X^-statistic (Serfling, 1980) which is known to provide a biased 
estimate for the ?7-functional in (7). It can be calculated in matrix form as l"^Kce„(G)l/n^. 
One can also construct an unbiased estimate using the corresponding JJ-statistic: 



Un = , ^ ,^ Y] V'-?^cen(G)(a:^»:a;j ) = ^ [l'^]Kce„(G) 1 " i''(]Kce„(G) )] , (8) 

n{n — 1) ^-^ ^ ' nln — 1 '■ ' ■' 

where tr denotes the usual matrix trace. Note that we will later use the notation traceQ{K) 
to refer to a functional version of the trace operation that is defined by 

,™ce„(A-) = )dG(.). (9) 

Note that for the Pearson Chi-squared distance, using the estimator Vn we get 

c 2 

Y,[f{A,)-G{A,)\ /g{A,), (10) 

i=l 

which equals the Pearson Chi-squared statistic divided by n, whereas the unbiased estimator 
using (8) gives 

C 



i=l 



[f{A,) - G(A,)] 7g(A,) - ^^-il. (11) 



Note that under the usual assumptions, n times (10) follows a Xc-i distribution, so its 
expectation is C — 1, whereas the unbiased estimator in (11) has expectation zero. 



2.2. Quadratic Risk definition and unbiased estimation 

Based on the quadratic distance we have already defined the quadratic risk of a parametric 
model M as 

Pd{Fr,M,m) = E{dK{Fr,Mg)). (12) 

Here 6 is an estimator of 0, based on a sample of size m. In our subsequent analyses we will 
assume that 9 is the maximum likelihood estimator of 6. Note also that Mg is the second 
argument in dxi-, so when we use a G-dependent kernel (see (4)) it will be playing the 
role of G. 
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Next we construct an unbiased estimator of the risk given in (12). We start by showing 
how to unbiasedly estimate the quadratic risk of the empirical distribution function, a 
nonparametric estimator of Ft- As mentioned earlier, comparison of nonparametric risk 
and parametric risk gives a global assessment of model quality. 



2.2.1. Unbiased estimation of empirical risk 

For simplicity of notation, we define the risk of the nonparametric fit, based on m observa- 
tions, by Rm- A straightforward calculation shows that 

Rm = E[dK{Fr,F)] ^ —traceF^{Kcen(F^))- (13) 

For example, for the Pearson Chi-squared example the above risk can be calculated as: 

c 



m 

2—1 



For most other kernels we will need to estimate Rm because it will depend on Ft . Provided 
that K does not depend on G, we can compute an unbiased estimator of trace{Kcen(F^))i 
based on a sample of size n using 

1 " 1 

-^K{x,,Xi)- — -^^A'(a;„a;,). (14) 

This immediately gives us the unbiased estimator of i?,,,, as 

1 1 " 1 

m Ln ^-^ n{n — 1) ^-^ ^-^ J 

i=l ^ ' i j^i 

Note that R„i ^' as n — > oo and if we set m ^ n, then i?„ = Op{l/ri). Hence any 
method of systematically selecting models with smaller estimated risk than i?„ will give 
risk estimates converging to zero. 



2.2.2. Unbiased estimation of risk of parametric models 

We next build an unbiased estimator of the risk of the parametric model M = {Me}, which 
is given in (12) This risk can be estimated unbiasedly at any sample size m < n — 2 as 
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follows: Let Am be a randomly selected subset of size m from {1, 2, rt}, and let the point 
estimator 9{A„i) be the value of 6 based on {xi : i G A„i}. Also, we define 

U{An) = -. TT^ 7T J2 ^cen(A/,„ (16) 

(n — m)(7i. — m — 1) ^ '•'•"V"8(-4„,)^ 

being the complement of set A. As is constructed from m independent observations, 
J7(A,„) is an unbiased estimator of p{M,m), for m < n ~ 2. Using (16) we can define an 
unbiased estimator by averaging over all possible subsets of size m, 

p(F,,M,m)^—S2u{Am,). 

Km) 

If one wishes to estimate p{F, M, n) , then to = n — 2 might be expected to generate the 
least bias. However, as we argue at the end of this section, other choices of to might be made 
for the purposes of cither consistency or of parsimony. Also, for computational reasons one 
might wish to use only a few selected subsets of size to, as in V-fold cross-validation (van dcr 
Laan et al., 2004), or randomly selected subsets of size to (Blom, 1976). 

Note that one can construct unbiased estimators of the AIC relative risk in a simi- 
lar fashion. That is, one can estimate the first term on the right hand side of (1) as 
^ log m(9(^_ij){xi) . This estimator is unbiased for risk at sample size to = n — 1. 

2.3. Decomposition of quadratic risl< and approximate quadratic risl< estimators 

In many model selection problem, we may choose to simplify the unbiased estimates of risk, 
which can be both difficult and expensive to compute. The main difficulty is that the point 
estimators are traditionally obtained using the notoriously slow EM algorithm and therefore 
it is difficult to obtain them to sufficiently high precision in each of the many "leave-out" 
recalculations. For this reason, we derive an alternative estimate of the quadratic risk, 
based on asymptotic expansions, similar to those of the AIC derivation. 
We start with the following decomposition of the distance: 

dKiF^,Mg) = dKiF^,Me^)+[dK{Fr,Mg)-dK{Fr,Me^)]. (17) 

Here 9t denotes the parameters of the distribution (in the class of distributions denoted by 
Mg) closest in quadratic distance to the true model Fr. That is, Or—aig'cnuig dK{FT, Mg). 
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We will call the first term on the right side of (17), the model lack of fit (MLF) because 
it assesses the distance between the best model element and the truth. In particular, it is 
zero if e Mg. Notice that MLF does not depend on parameter estimation. 

We will call the second term on the right side of (17), [dK{Fr, Mg) — dxiFr, Afg^)] , the 
parameter estimation error because it measures the deviation of 9 from the best parameter 
9t-. We observe that it is always non- negative, by definition of 9^, and its magnitude 
increases as 9 deviates from 6r. In fact, it is shown in the appendix (equation (36)) that 
this term is approximately a quadratic form in {9 — 9t-), which implies that the parameter 
estimation cost is a monotonic function of (9 — 9t-). Based on the interpretation of the 
decomposition of the distance we can rewrite (17) as 

dK{Fr,Mg) = MLF + Parameter Estimation Error. (18) 

For the risk calculation we take expectations in (18) to get 

E{dK{Fr,Mg)) = MLF + PEC(^), (19) 

where PEC(^m) = E [d/f (^r, Mg) — dxiFr, Mg^)] represents the parameter estimation cost 
when 9 comes from a sample of size m. 

This decomposition is essential both for finding an approximation to the overall risk and 
for providing a useful interpretation of the different errors driving the risk. The approxi- 
mation and the estimates for the two terms, MLF and PEC, will be directly based on the 
asymptotic theory of quadratic distance (Lindsay et al., 2007) and the unbiased estimators 
of quadratic distance. 

Now we will state a few results from Lindsay et al. (2007) and use these results to justify 
our approximation and finally provide an appropriate estimator for the quadratic risk. 

2.3.1. Kernel projection operator and score centered kernel 

Crucial to these approximations of quadratic risk are the concept of scored centered kernel 
and the score based projection operators which we now define. 

For a parametric model given by Gg, with density given by g{x; 9), let us denote the set 
of score functions by V log g{xi; 9) = s{xi;9), where V is the vector differential with respect 
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to every element of 6. Note the MLE 9 is obtained as a solution to ^ s{xi; 6) = 0. Further, 
we define the extended score vector Ug{x) = (1, s(x, 6*)-^)^ and the extended information 
matrix for a single observation to be: 

We will then let P* be the kernel operator defined by 

p;{x,y)^u;ixf u;{y). (20) 

That is, Pg acts as a projection operator on the likelihood scores Ug. The score centered 
kernel Kscen(Gg){^iy)i ^-s centered under Gg, is defined to be 

K^ceniC) - {I - P*e)K{I - P;) 

= K{x, v) -JPg*{x, z)K{z, v) dGg{z) - jK{x, z)P*g{z, v)dGg{z) 

(p;{x,z)K{z,w)P;{w,v)dGe{z)dGg{w). (2f) 



Similar to the matrix form of K, we define the n x n matrix ¥g = UgJg^uJ , where Ug 
is the n X p matrix with entries de^ [log ^^(x'i)], Xi being the i*^ data point and 9j being the 
J*'' component of 6. Similarly we represent the matrix version of the scored centered kernel 
as IK5cen(e) which can be calculated as 

^sceniGo) = (I - ) ' K • (I - P^) = (I - P^) • K,,„(g,) • (I - P^). 

The first important property of score centering is the following alternative representation 
for the empirical distance between the data and the estimated model: 



dKiF^Gg) ^ JJ K,,,^(Gs){x,y) dF{x)dF{y), 

which has the following asymptotic property: 

Theorem 1. Lindsay et al. (2007) Given the regularity assumptions stated in Lind- 
say et al. (2007), under Gg we have 

^ ^ 0. (22) 



dK{F,Gg)- // Kscen(G,){x,y)dF(x)dF{y) 
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This result can then be used to show that nd^iF, Gg) has an asymptotic distribution that 
can be represented as Y,°^iXiXi, where the A^'s are the eigenvahies of the spectral decom- 
position of kernel Kscen{Ge)- ^^(^ asymptotic mean is therefore S^j^A^ = trace {Ks^^n{Ge)) 
(see Lindsay et al., 2007, for details). 

2.3.2. Approximate quadratic risk estimators 

We now return to our original purpose of approximating the different terms of our quadratic 
risk. These approximations will be based on asymptotic calculations for each model, assum- 
ing that the model is correct. This same technique is used in the AIC calculation, where 
it clearly produces great simplification in risk estimation. We start with the asymptotic 
approximation to PEC: 

Proposition 2.1. Suppose that Me has the density function me and that the true model 
Ft is in the model space Ale, having parameter Or- In this case dxiFr, Mg^) = 0. Define 

V{x,y) = jj P{x,z)K{z,w)P{w,y) Mg^{z)MeAw) 

where P{x,y) = u{x)'^ S^^u{y). Then as m, the sample size for 9, goes to infinity, 

mdK{Fr,M^) -mJJ V{x, y) F{x)F{y) 0. 

It follows that the asymptotic mean of rndxiFr, Mg) is 

treAV) = JJ V{x,y) Me^{x)MeM- 

Proof. See Appendix. 

As a simple example, for the Pearson Chi-squared kernel the PEC is simply dim{6) /m, where 
dim{6) denotes the number of parameters being estimated. In general we have to estimate 
PEC and for an estimator of trace{PgKPg), we use its empirical version iir(P^IKPp based 
on the full sample. This yields the following estimator: 



Quadratic Risk 1 5 

Now we turn to the estimation of MLF=d/^(F^, Mg^), a term not depending on hypo- 
thetical sample size m. We start with its sample equivalent given by 

1 



dK{F,M,) 



Kscen(Mjj){x, y)AF{x)dF{y) 



(24) 



This estimator has bias E{dK{F, Alg)) — MLF. To do bias correction we again use an 
asymptotic approximation assuming that ~ l^Ie^, so that MLF ~ 0. Applying the 
remarks following Theorem 1 we obtain the following approximation to the bias term: 



E{dK{F, Mg)) - MLF « -tracee^{I< 



scen{AIe^ 



))• 



The right side of (25) can be estimated by its sample equivalent 

^tr(I-Pt)K(I-Pt). 
Thus, we have the bias corrected estimator of MLF, 



MLF : 



1 



dkiF, Mg) ~ —ir{\ - Pt)K(I - Pt) 



(25) 



(26) 



(27) 



For example, in the Pearson Chi-squared case (26) reduces to (C — 1 — dim{6))/n and thus 
the MLF can be estimated as: 



r - 1^ 

^F{A,) ~ Gg{A,)^ C-l-dim{e) 



G,-(AO 

Since (C — 1 — dim{d)) gives the residual degrees of freedom for the C-cell Pearson Chi- 
squared test it is clear how the correction removes the bias inherent in the Chi-square 
limiting distribution. 

Finally combining the estimators of PEC(m) and MLF given in (23) and (27) we have 
the following estimator for the quadratic risk: 



PdK{M,m) 



4(F,Af.)--tr(I-Pt)K(I-Pt) 



-tr(PtKP*-) 



1 



1 



-tr{l ~ P|)]K(I - P*-) 



1 



-ir(P^KP| 



(28) 



Note that for the Pearson Chi-squared example the risk estimator, when calculated at 
m ~ n simply becomes: 



i:[^(a,)-g,(a)]7g,(a.)- '^-"+/^-w . 



(29) 
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Observe that the risk estimator given by (28) is numerically less expensive than the 
cross validation based exact estimator given in (16). Unlike the unbiased risk estimator, it 
does not require repeated parameter estimation. Moreover, the matrix operations involved 
in the calculation are also inexpensive, making (28) a promising risk estimator. 

For model comparison purposes we will also want to estimate the risk of the empirical 
distribution function F, in which case = /. The estimated distance dK{F,F) is zero. So 
a biased estimator of Rm, the risk of the F^ at sample size m, may be calculated using 

^ = Z^i*^ (KeniF^) ■ (30) 



nni \ 

We will later see that if we use an appropriately scaled kernel K'^ , % is exactly equal 
to the degrees of freedom corresponding to the scaled kernel A''^. In fact for the Pearson 
Chi-squared kernel = C — 1, which is the degrees of freedom for a C-cell Chi-squared 
goodness of fit test. 

In the special case that the estimated risk is evaluated at m — n we will call it the 
Quadratic AIC risk or in short QAIC, where AIC reflects commonality with the AIC deriva- 
tion. We will now show how to mimic the BIG criterion by using a different value of m. 
Recall that for the usual AIC with KL loss function, the relative MLF, after subtracting 
/ \og{fr{x))fT-{x)dx common to all models corresponds to —J log(me^)/T-(a;) (see (1)), which 
has the approximately unbiased estimator (under the model): 

mj = -2i + ^. 

n n 

The parameter estimation cost, PEC(„i) ~ dim{9)/'m, giving us the estimated risk, 

I dim(9) dimiO) 

-2- + !-^ + (31) 

n n m 

For AIC risk we use PEC(„), which gives us AlC/n ~ —2l/n + 2dim{9)/n. Note the 

similarity of this risk to the Pearson risk estimator in (29). The essential difference is that 

the latter estimates absolute risk, not relative risk. If one uses m = n/{\ogn — 1) in (31), 

one arrives at the standard BIC formula: 

BIC _ I \og{n)dim{9) 
n n n 

^ K" = K/c, where c = tr{K)/tr{K'^) 



For this reason the estimated risk (28) evaluated at m 
and can be written as 
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= n/(logn - 1) will be called QBIC, 



QBIC = MLF + (log(n) - 1) x PEC(„). 



dkiF,Mg)--triI-Vl)K{l-Fl) 



(log(n) - 1) 



(32) 



Along with a recipe for estimation of the risk, the decomposition in (28) provides an ex- 
cellent tool for deeper understanding and analysis of the interplay of model misspecification 
and the effective parameter cost of using the model. This provides us with the ingredients 
for constructing alternative model selection tools and tools for accessing global fits which 
we discuss in the following section 



2.4. Consistency and clioice of m 

It seems very natural to assess the risk of a model at the sample size m — n. This risk 
indicates how well the chosen model, along with the current parameter estimates, might 
be expected to perform when used to approximate the true distribution. Unfortunately, 
estimating the risk at m = ri is a hard problem: one cannot estimate this risk well enough 
to discriminate consistently between two true models with different numbers of parameters. 

In this section we will consider the meaning of consistency under several asymptotic 
scenarios. We will use the AIC method to illustrate our points because of its simple form. 
In all of them, we will suppose that the true distribution is in one of the models, i.e E A4. 

First, if one leaves m fixed as n — + oo, then the best model, in terms of risk, need not 
contain the true distribution, and consistency would mean selection of this (possibly false) 
best model. This might be desirable for reasons of parsimony, as the true models may be 
too complex to be useful. 

In a formal sense, methods like AIC cannot be consistent in this asymptotic scenario 
because they use asymptotic approximations for the cost of estimating parameters, and 
those approximations assume the model is true. If one replaces the exact risk with the 
asymptotic approximation, however, consistency is achieved. See, for example, the AIC in 
(31), where the MLF term is consistent for MLF, and the approximate parameter estimation 
cost is just a function of parameter dimension when m is fixed. 
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A second way to view consistency is to let m = ?t!,„ increase with n. In this case, the 
parameter estimation costs shrink to zero, and so consistency would involve selecting a true 
model. A stringent form of consistency, but natural, would require the method to select the 
smallest true model with probability approaching one. Wc will call this smallcst-truc-model 
(STM) consistency. 

It is well known that AIC, with m„ — lacks this form of consistency. When two 
true models A4i C A^2 arc nested (and regularity conditions hold), the probability that 
AIC picks the larger of the two models corresponds to using a likelihood ratio test with the 
constant critical value 2(dim(A^2) — dim(A^i)). For example if the difference of dimension 
is 1, with probability 0.16(= Pr{xi > 2)) we choose the larger model. 

It turns out that STM consistency holds when m/n goes to zero. The proportional 
case, when m„ w ^n, for constant 7 > 0, is at the cutting edge of consistency. It is easily 
checked that for the AIC criterion, it is equivalent to using the critical value (1 + 7"^) 
(dim(y\/(2) — dim(A^i)). For example if the difference of dimension is 1, the probability of 
overshooting is around .045, .025, .01, .001 for the proportion 7 = Jq respectively. 

Thus as 7 approaches zero, the probability of overshooting goes to zero. 

Although one might choose to use a BIC type criterion for its familiarity, we think it 
would be very useful to also describe the null probability of rejection for the corresponding 
likelihood ratio critical value. 

3. Model Selection using Quadratic Risk 

Now, we provide a short description of how one might use quadratic risk as a model selection 
tool. First, one needs to imdcrstand the interplay of the terms in the risk decomposition. 
Please see Figure 1 for an illustration. It shows the relevant quantities for an example in 
Modcl(3) with n = 300, described later in Section 5. Suppose wc have a sequence of models 
Mk that are nested with increasing k. As model complexity k increases the empirical 
distance dxiF ,Mk) decreases but must stay non-negative. We also know that MLF, for 
any model that contains Ft-, will tend to zero in probability. The estimated parameter 
estimation cost PEC starts near zero and increases with model complexity. However it never 
will be larger than the risk of the empirical estimate, which is the parameter estimation cost 
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Fig. 1. Illustration of (a) components of the risk measure and (b) and the risk measures correspond- 
ing to a single run of the simulation of l\/lodel(3) described in Section 5.1 .1 . 

of a non-parametric fit. By construction, the total quadratic risk is the simple addition of 
the MLF and the PEC. Thus models with low estimated risk arise as a compromise between 
the decreasing MLF and the increasing PEC. A simple model selection rule is to pick the 
model having the minimum value of QAIC. 

The QBIC form of the risk, given in (32) introduces a larger penalty for parameter 
estimation than QAIC. In Figure 1(b) one can sec how this alteration has created a much 
sharper minimum in the risk function. Both for QAIC and QBIC, we will say that the 
model with the minimum estimated risk is the best model. 

Just like AIC and BIC, quadratic risks need not attain an absolute minimum in the 
range of models considered. However, there is an important benchmark, given by the em- 
pirical risk, for evaluating the performance of a model when using quadratic risk measures. 
Note that for the empirical distribution there is no model lack of fit, because the model is 
nonparamctric, but there is the maximal possible parameter estimation cost. Models that 
don't meet this benchmark clearly suffer from substantial model lack of fit and so we will 
label them as "inadequate models," while the rest will be called "adequate" . Even in the 
presence of a minimum risk model, if all our proposed models are "inadequate" it indicates 
that more model building in the same class, or exploration in a larger class of models might 
be necessary. For example, based on the QAIC risk for proposed models and empirical 
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model in Figure 1(b), if we had explored only the 5 smallest models, then QAIC and QBIC 
would have selected model 3. However based on the MRA criterion no model would have 
been found adequate, thus providing the crucial global framework for model selection. 

On the other hand if wc have more than one "adequate models" we will call the smallest 
model in this set the minimal-risk- adequate model (henceforth denoted as MRA). Note 
that the consistency of the empirical risk (shown in Section 2.2.1)ensures that this model 
selection method would have estimated risk going to as n — > oo Now inspecting Figure 1(b) 
for the whole range of 10 components model, we see that all models from 6 onwards are 
adequate. Thus model 6 is the MRA model, even though the QAIC selects model 9 as the 
minimum risk model. 

We will provide illustrations of these criteria in the application section. 



4. Application: Selecting the number of components in mixture models 

Now we apply our quadratic risk model selection methodology to the problem of selecting 
the number of components in a multivariate normal mixture model. We start by introducing 
the following notation and definitions for mixture models. 

A random variable X e is said to follow a fc-component normal mixture model if its 
density /'■'^•' can be written as 

k 

/('=)(x;M,S,7r) 0(a;;/i„S,) for.TeM^, (33) 

where 0(a;; fij, Sj) denotes a normal density with mean fij and variance Ej, and iTj < IVj 
and ""j ~ 1- F'^'' compactness we denote 9 = {fi, S,7r}, with the above restrictions 

and denote the density in (33) as fg''^. 

According to the general framework, now M denotes the class of all _D-dimensional finite 
mixtures with normal mixing components; i.e. for a fixed D, our M = {A4i, A^2, • ■ •}, where 
A4k denotes the fc-component normal mixture models, and for a particular model element 
Mg e Mk, more precisely denoted as Mg''^ we have dMg''^ = fg'^K 

The complexity of the null distribution makes a testing theory for the number of compo- 
nents rather more difficult. It also means that the standard regularity assumptions behind 
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the derivations of the AIC and BIC are false. For a description of the previous approaches 
to selecting the number of components of mixture models see McLachlan and Peel (2000). 

4. 1 . Risk-based analysis of mixture complexity 

We now develop a strategy for selecting the number of components using our quadratic risk 
assessment approach. First, wc will specify the kernel in order to specify the loss function. 
We will then outline the steps for estimating the risk function and finally illustrate the use 
of the risk measure in choosing a mixture model. 

4.1.1. Specifying the kernel 

In principle, any nonnegative definite kernel can be used to build the distance measure. 
However, from the results in Section 2 it is clear that the key to the calculation of the 
distance, and hence the risk, arc the integrals required in the definition of the quadratic 
distance. These will involve high dimensional numerical integration for arbitrary choice 
of the kernel K. For this reason it is desirable to use kernels for which the integrations 
/ K{x,y)dM{y) and // K{x,y)dM{x)dM{y) can be carried out exphcitly. 

When the model is a finite mixture of normals, the multivariate normal kernel meets 
this goal. The £)-dimensional normal kernel, in its most general form, is defined as 



We will take S = 1t?I ^ h being a "smoothing parameter" and henceforth denote the 
corresponding kernel as K''^. This constant h can be thought of as a "tuning" parameter, 
something analogous to the bin-width in the construction of a histogram. We will discuss 
its role further in the next subsection. 

4.1.2. Role of tuning parameter "h" and its empirical estimation 

Lindsay et al. (2007) developed a tool for the understanding of the operating characteristics 
of a quadratic distance they called the spectral degrees of freedom of the kernel (sDOF). It 
is a generalization of the degrees of freedom of the chi-squared distance (itself a quadratic 
distance) to other kernels by examining their limiting distributions. With this tool one can 
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roughly equate the degree of smoothing that comes from a choice of h to that of the chi- 
squared test having the same degrees of freedom, which in turn corresponds to the number 
of cells used in its construction. The tuning parameter h is analogous to choosing the bin- 
width of each cell (or, cquivalcntly choosing the number of cells) in a Chi-squared goodness 
of fit test. 

The spectral degrees of freedom of the kernel depend on the true sampling distribution 
and the kernel K (whether centered or score centered) through the formula 

sDOF= l^j K''{x,x)dFr{x)^ I J {K''{x,x)f dFr{x). (34) 

In this paper, we will base our selection of h on the empirical estimate of sDOF given by 

s-DOF = (tr{Kl^^,^)f/tr (35) 

where the centered kernel matrix based on the empirical distribution has the ij*^ element 

^cen(F) ' ) = , ) - ^ Ei Kh {xi,Xj)-^Y.i Kh {x^ , ) + 4^ ^ . Kh {xi ,Xj). 

Just as in a Chi-squared goodness of fit test, there are choices for the spectral degrees 
of freedom that are clearly too small and others that are clearly too large. As a rough rule 
of thumb for selecting the number of cells for _D = 1 we suggest that the degrees of freedom 
should be more than 5 and less than n/5, n being the total number of observations. For 
D > 1 dimensions one needs to increase the lower bound of 5 because the goodness of fit 
test now must assess fit in several directions. We find that if sDOF is smaller than (^^^) , 
it is likely to be too much smoothing, whereas if sDOF > n/5 it would mean too little 
smoothing. See our simulation section for more on the role of the smoothing parameter on 
model selection. 

4.2. Constructing the risk estimators for mixture model selection 

Till now we have focused on the form and the smoothing parameter of the kernel for 
our problem. Using this kernel we will now illustrate the steps in estimating the risk of 
each parametric mixture model M.ki k = 1,2,... along with the risk of the empirical 
model. First, we compute the estimates of the parameters 9k for each parametric model 
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Model (1) Model (2) Model (3) Model (4) 



Fig. 2. Representative datasets from two dimensional simulation 

M.k- Simultaneously, we construct a projection matrix for each model, which is strictly 
based on the estimates 9k and the observations. We also calculate the score-centered kernel 
KgcenfM- )■ Then we calculate the risk estimate of each model using (28) and (32). 

5. Simulation results and Applications 

Now we apply our model selection tool to select the number of components in mixtures. The 
following steps were used for choosing models in all the examples of this section. First, we 
standardized the data (i.e. the variance of each variable was scaled to unity). This step 
was done because we used the same h for all the variables. In this simulation study we 
restrict ourselves to normal kernels based on the discussion in 4.1.1. Using the results in 
Section 4.1.2 we estimated the sDOF, which gives us a range of interesting smoothing 
parameter values. The simulation results reported in this section used one representative 
value from this range, though the results were quite stable across our suggested range. 
The ML estimates for the range of models under consideration are calculated using an EM 
Algorithm and the projection matrices based on this estimate. Note that for analysis at 
different levels of smoothing we only need to recalculate the distance and computationally 
simple matrix K. and not the projection matrix P^. 

5.1. Simulation experiment 

Given a mixture parameter set and a fixed sample size, we generated 100 sets of random 
samples from the true mixture distribution. We have tabulated the frequency of selection 
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Table 1. Model Selection results for a set of 2 dimensional simulated 
datasets. Boldface* indicates selection of true model. 



Estimated number of Components 





12 3 4 
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6 


7 
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9 


>10 


Model (1): 


: 2 moderately overlapping clusters, 2D, n; 


=1000 






AIC 


- 63* 16 6 


4 


4 


3 


2 


1 


1 


QAIC 


5 62* 19 2 


3 


6 


3 


- 


- 


- 


MRA 100 


- 


- 


- 


- 


- 


- 


BIG 48 52* 


_ 


_ 


_ 


- 


- 


- 


QBIG 21 77* 2 


- 


- 


- 


- 


- 


- 


Model (2): 


: 4 distinct clusters, 2D, 


n= 
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AIC 


52* 


17 
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QAIC 
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18 


6 
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1 


MRA 


- 100* 














BIG 
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QBIC 


98* 


2 












Model (3): 


: 6 distinct clusters, 2D, 


n= 


=1000 










AIC 






43* 


34 


12 


4 


7 


QAIC 






64* 


31 


9 


4 


2 


MRA 






100* 










BIG 


5 




95* 










QBIC 


1 




94* 


5 








Model (3): 


: 6 distinct clusters, 2D, 


n= 


=300 










AIC 






1* 


1 


6 


28 


64 


QAIC 




1 


28* 


38 


17 


10 


6 


MRA 


3 25 16 


13 


38* 


5 








BIG 


- 50 26 


12 


12* 










QBIC 


- 11 5 


17 


49* 


17 


1 







of the number of components by each of the 5 model selection methods (AIC, QAIC, 
MRA, BIC, and QBIC) under different examples, in an experimental design that varies by 
dimension, sample size and separateness of the components. For some example we also vary 
the range of models in the model space. 

Here, we have fixed the variance structure, so our model selection only refers to selecting 
the number of components. Readers familiar with the MCLUST software should note 
that Fraley and Raftcry (1999) provide a unified strategy to select both the number of 
components and the variance structure, but only with a single criterion, the BIC. We now 
discuss the results of our simulation experiment. 



5.1.1. Two dimensional simulation 

In our first study we held the data dimension to two and varied the normal mixture model 
as well as the sample size (for some cases). The scatterplot of a representative sample from 
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the simulation of each of the 4 models is given in Figure 2. 

First we consider simulations from the first 3 models which are generated from 2, 4 and 
6 component normal mixtures respectively. From Table 1 it is clear that AIC and QAIC 
are the more liberal methods- in the sense of favoring models larger than the true model. 
Although QBIC and BIG arc both potentially conservative, QBIC is slightly less so and 
made model selection errors on both sides of the true value. The MRA criterion (here using 
the m = n risk) was fairly conservative in some cases (e.g. Simulation 1) but otherwise 
fairly competitive. Like BIG it never overshot the true model. 

Next we explore the effect of sample size on model selection in Model (3). Decreasing the 
sample size from 1000 to 300 tended to increase the underestimation of the three conservative 
methods, but did not greatly degrade the liberal estimates of AIC and QAIC, reflecting their 
lack of consistency. 

Overall QAIC performed much better than AIC. Though QAIC overestimated the com- 
plexity, in comparison to AIC the distribution of the the number of components was much 
more peaked at the true number while the right tail tapered off very quickly. Also, if the 
components had high degree of overlap, in the sense of not displaying distinct modes for dis- 
tinct components, then QBIC performed better than BIC, which most often imdcrcstimated 
the number of components. 

Note also that the quadratic risk based methods arc more stable with change of sample 
size. In those cases where the sample size was small and BIC badly underestimated, the 
QBIC and MRA methods were distinctly better. Finally, the simulation study revealed that 
adequacy is a very useful measure. In many examples it provided extra information about 
the minimal number of components that would be needed to explain the data in hand. 
Often the minimal adequate model had a considerably smaller number of components than 
the one with minimum risk QAIC model. 

5.2. Quadratic risl< as global measure of risk 

Next we explore two examples where the true model is beyond our model space and show 
how the quadratic risk measures provides an excellent global model selection criterion in 
identifying the scenario. We have already observed that with n=1000 in Model (3) with 
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Table 2. Model selection results demonstrating global risk measure 

Model (3): 6 components, 2D, n=1000 (limited to 5 components) 



Estimated number of Components 



1 2 


3 


4 


> 5 


AIC - - 


52 


9 


39 


QAIC - - 


58 


6 


36 


MRA - - 








BIG - - 


100 






QBIC - - 


87 




13 



six distinct normal components. QAIC, QBIC, MRA and BIC all overwhelmingly select six 
components. But what happens if we explore only up to 5 components, i.e we do not include 
the true model in the model class? Table 2 summarizes the results for the 5 different criteria 
when we explore only up to 5 components. We immediately observe that BIC selects the 3 
component model in 100% of the cases. Independently our quadratic measures also selected 
this model in 52% and 87% of the cases, but the MRA criterion shows that none of the 
models arc adequate. Thus if we based our conclusion on BIC alone we would have failed to 
recognize that we arc trapped in a local minima. But if we use the quadratic risk measure 
along with the MRA criterion we are forced to explore models beyond 3 components as we 
have a clear indication that even the 5 component model is not adequate. 

We now explore an example where the model is not a normal mixture, but rather when 
the scatter of the distribution has the shape of the letter U, as given in Figure 2 (Model 
4). These datasets were simulated by first generating the x coordinate uniformly between 
-1.5 and 1.5 and then generating y uniformly between — 1 and + 1. By design, this 
density is very hard to capture with a few normal components. So we explored up to 14 
normal components. BIC and QBIC mostly choose 8-10 components. Even using the liberal 
measure QAIC, 51% of the times we choose a model with less than 14 components. But the 
MRA criterion overwhelmingly (70% of the cases) shows that even a 14 component model 
is not adequate. This suggests that we should either explore beyond 14 component model 
or we should explore beyond normal mixtures. 

In both these examples it can be easily seen that BIC, which is the most widely used 
model selection method for mixtures, does not provide the best result, simply because it 
is a relative rather than a global measure of risk. On the other hand in each of these 
simulations the MRA criterion clearly points out, that the minimum risk model does not 
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Table 3. Model selection results for the U Example (simulation (4)) 

Estimated number of Components 
1 2 3 4 5 6 7 8 9 10 11 12 13 >14 



AIC ------- - - - - 2 16 82 

QAIC ------- - 1 5 7 19 19 49 

MRA --------13783 8 

BIC ------- 16 50 24 6 3 - 1 

QBIC ------- 12 30 28 9 10 7 4 



Table 4. High dimensional simulation results 
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6 components 
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Model (7): 


6 components 


n=1000 in 12 dimensions 






AIC - 




- 12' 31 18 16 
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- - 13 


7 


80 


QAIC - 




- 49' 34 9 5 


3 
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21 
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MRA - 




10 88* 1 1 - 








10 


BIC - 
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4 63' 5 1 - 
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4 


QBIC - 


7 - 


3 81' 6 2 
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3 



lie in the model class that we are considering. 



5.2.1. High dimensional simulation 

We now examine the effect of increasing dimensions. Our 4D, 8D and 12D examples with 6 
components were generated in the following way: The first two dimensions are the same as 
2D Model (3) used in Section 5.1.1; the remaining dimensions are generated from standard 
normal variates. Like the 2D Model (3) the performance of the 5 criteria are judged under 
two scenarios (i) when the true model is in the model space (ii) and when the true model 
was not included in the search space. The results for both scenarios are given in Table 4, the 
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left panel corresponding to scenario (i), where we explore up to 10 components and the right 
panel corresponding to scenario(2) where we restrict ourselves to less than 6 components. 

First we summarize the results from scenario (i). AIC grossly overestimates the num- 
ber of components. QAIC overestimated the number of components but less so than the 
AIC, but more importantly its performance did not deteriorate much with the increase 
of dimensions. As dimension increases, BIG tends to underestimate the true number of 
components with higher percentage, and in many cases (20%, 23%, 25% for 4, 8 and 12 
dimensions respectively) BIG even chose the 3 component model. QBIC and MRA erred on 
either side of the true model, but both of them chose the true model with higher percentage 
than BIG. Moreover, when QBIG and MRA underestimated they erred slightly selecting 5 
components, while BIG selected 3 component model in many cases. 

For scenario (ii) the results are consistent with the findings of the two dimensional case, 
i.e. the various information criterion overwhelmingly suggest that the 3 component model 
is the true model, whereas the MRA criterion combined with the quadratic measures clearly 
points out (in 96%, 92%, 90% for 4, 8 and 12 dimensions respectively) that we have not 
reached a global minimum and we should expand our search space. 

These examples clearly show the usefulness of quadratic risk based methods in high 
dimensions, both for selecting the true model and providing a global infrastructure for 
model selection. 

5.3. Application to real dataset 

We apply our model selection criterion to a real dataset where the goal is to identify the 
number of gene groups displaying distinctive gene expression profile under a set of condi- 
tions. The details of the experiment is described in Ghitikila ct al. (2002) and our dataset 
was obtained through personal communication from Dr. F. Pugh of Pennsylvania State 
University. Here we present a very short description of the experimental design and pre- 
processing of the data. 

The experiment was designed to answer an interesting questions in transcriptional ge- 
nomics: Is the binding phenomenon of TATA binding proteins (TBP) affected by neighbor- 
ing TAFl proteins? The TBPs bind to the TATA box and play a crucial role in transcription 
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which finaUy results in protein synthesis. The TBP is intertwined to form a 3D structure 
in such a way that it forms a concave surface and a convex surface; the concave surface 
attaches to the TATA box, whereas the convex surface may attach to some other protein, 
for example the TAND region of TAFl proteins. To understand this interplay of TBP 
and the TAND region of TAFl, a set of 19 experiments, were designed using 6226 yeast 
genes for each strain. The two main factors that were varied in these experiment are (i) 
whether the TAFl proteins of the yeast cells had the the TAND regions intact (wild types) 
or whether the TAND regions were removed (A-TAND's), (ii) whether the concave region of 
the TBP proteins remained unmutated (wild type) or whether a yeast strain with mutation 
was used. Further 3 different types of mutations and a few controls were also used. Gene 
clusters displaying distinct gene expression profiles, one for each of the 4 combinations of 
the two levels of the two major factors would suggest a "interaction" model, whereas only 
two clusters would suggest that TBP's are not influenced by neighboring proteins. 

The analysis was done as follows. After leaving out genes which did not show significant 
change over different conditions, we analyzed n=:2809 genes under d=19 conditions (di- 
mensions), completely ignoring the information on the identity of the conditions. We used 
normal mixtures to fit the data and explored up to 8 components. The risk curves of each 
of the 4 model selection criteria is given in Figure 3. Based on the minimum risk, BIG chose 
three components where as AIG chose nine components. For calculating the quadratic risk 
we first applied the sDOF analysis detailed in Section 4.2.2 and chose a range of h. The risk 
curves given in Figure 3 are for ft, = 1.7 for which the sDOF was 226, but the minimum risk 
decision remained stable for a wide range of smoothing parameters. Based on the quadratic 
risk calculations QAIG chose six components and QBIG chose four components. Moreover 
using the MRA criterion it was clear that four components was adequate. 

There is a sense in which selecting four components is the right answer for this problem. 
Biologists conjecture that if the interaction model is true there should be 4 groups of genes 
each displaying a separate profile under the 4 conditions. Among the 3 clusters given by BIG, 
one profile matched with the distinct expression profile for A-TAND with no mutation, but 
the other two gene clusters were mixed. On the other hand the four clusters given by QBIG 
clearly identifies four distinct patterns under the major combinations, thereby providing 
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Fig. 3. Risk Calculation for the yeast gene expression dataset 



a meaningful summary and cluster analysis of this high dimensional data. Additionally 
QAIC and more prominently QBIC shows a local minima for a two components mixture. 
This suggested there might be two big groups. Further exploration shows that the two top 
clusters differentiated the mutated and unmutated TBP yeast strains. 

6. Conclusion and Future Direction 

In this paper we have developed a comprehensive tool for high dimensional model selection, 
using quadratic risk. Key to the calculation of quadratic risk is its representation and 
decompositioir using appropriate projection operators, and the estimation of these operators 
using empirical versions. Decomposing the quadratic risk into the model lack of fit and the 
parameter estimation cost enabled us to build a method, QBIC, that mimics BIG. 

One feature of the derivation of our methodology, one that separates it from the AIC 
and BIG, is that the asymptotic expansions are based on nonparametric goodness of fit 
tests, not likelihood ratio asymptotics. We believe that our unified approach for building 
and estimating quadratic risk could pave the way for designing appropriate model selection 
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criterion for a host of previously unsolved problems, especially where the irregularity of 
parameter space eliminates the asymptotic theory underlying the use of AIC, BIC and 
similar other criteria. In addition, the risk of the empirical estimate, which is a natural 
outcome of our risk analysis, provides a threshold that enables us to assess whether the 
model that was selected was itself a good fit. Unlike AIC and BIC where only the optinmm 
model is chosen, model adequacy provides us with the extra information of whether any of 
the models are adequate, or if there is a range of models that are adequate. 

In addition, we also showed that the use of an appropriate kernel could enable one to 
minimize the computational burden associated with a high-dimensional problem. Moreover, 
there is certainly more to learn about the structure of a dataset than can be revealed by 
analyzing a data with a single model selection criterion. This is an area of future research. 
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Appendix: Proof of Proposition 2.1 

We start with an asymptotic approximation to Parameter Estimation Error = [d/^ (i^,-, Af^) — (ix(^r, Afg^)] ■ 
Now we define a function in the 6'-space, by L{9t, 0) = dK{Mg, Mg^), which has the form 




K (x, y) d{Mg - Mg^){x)d{Mg - MeJ(y). 



For density function uig we assume that we can create a Taylor expansion in 6r such that 




Thus we get the following quadratic approximation for the loss 




ug^ix)Kix,y)ug^iyf Me^{x)Mg{y) {§ - er) + o{m\e - e,\^) 

(36) 



Next, under standard regularity assumptions, it is a well known result for maximum likeli- 



hood that 



m 




This shows that the error term in (36) is Op(l). In addition, substituting this expression 



into the first term gives the relationship: 




(37) 
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as needed for the proposition. Note that / V{x,y)Me^{y) = 0, so the asymptotic mean for 
mLiO-n 9) is 



E 



V{x,v) F{x)F{y) 



